RGB-D Camera Based Tracking System and Method Thereof

ABSTRACT

A visual SLAM system comprises a plurality of keyframes including a keyframe, a current keyframe, and a previous keyframe, a dual dense visual odometry configured to provide a pairwise transformation estimate between two of the plurality of keyframes, a frame generator configured to create keyframe graph, a loop constraint evaluator adds a constraint to the receiving keyframe graph, and a graph optimizer configured to produce a map with trajectory.

CROSS-REFERENCE TO RELATED APPLICATION

This application claims priority to a U.S. provisional patent application Ser. No. 62/354,251, filed Jun. 24, 2016, the contents of which are incorporated herein by reference as if fully enclosed herein.

FIELD

This disclosure relates generally to tracking systems and, more particularly, to a RGB-D camera based tracking system and method thereof.

BACKGROUND

Unless otherwise indicated herein, the materials described in this section are not prior art to the claims in this application and are not admitted to the prior art by inclusion in this section.

SUMMARY

A summary of certain embodiments disclosed herein is set forth below. It should be understood that these aspects are presented merely to provide the reader with a brief summary of these certain embodiments and that these aspects are not intended to limit the scope of this disclosure. Indeed, this disclosure may encompass a variety of aspects that may not be set forth below.

Embodiments of the disclosure related to a method for computing visual Simultaneous localization and Mapping (SLAM). The method comprises generating, by a visual odometry module, a local odometry estimate; generating, by a keyframe generator, keyframes; creating keyframe graph; adding constraint to the keyframe graph using a loop constraint evaluator; and optimizing the keyframe graph with trajectory. The method further comprising generating a new keyframe between a keyframe and a current frame before generating a local odometry estimate. The method of adding constraint to the keyframe graph using a loop constraint evaluator is based on a loop closure wherein the loop closure is the return to previously visited locations. The method further comprises adjusting a pose graph based on edge heights of different constraints in the keyframe graph after optimization.

According to another aspect of the disclosure, a method of applying a probabilistic sensor model for a dense visual odometry comprises generating, by a keyframe generator, keyframes, creating keyframe graph, adding constraint to the keyframe graph using a loop constraint evaluator, and optimizing the keyframe graph with trajectory. The method further comprises generating a new keyframe between a keyframe and a current frame before generating a local odometry estimate. The method of adding constraint to the keyframe graph using a loop constraint evaluator is based on a loop closure wherein the loop closure is the return to previously visited locations. The method further comprises adjusting a pose graph based on edge heights of different constraints in the keyframe graph after optimization.

According to another aspect of the disclosure, a method of t-distribution for photometric errors and a probabilistic sensor model for geometric errors comprises:

${\hat{\xi}}_{Hybrid} = {\underset{\xi}{\arg \; \min}{\sum\limits_{i = 1}^{n}{r_{i}^{T}W_{i}^{1/2}\Sigma^{- 1}w_{i}^{1/2}r_{i}}}}$

According to another aspect of the disclosure, a visual SLAM system comprises a plurality of keyframes including a keyframe, a current keyframe, and a previous keyframe, a dual dense visual odometry configured to provide a pairwise transformation estimate between two of the plurality of keyframes, a frame generator configured to create keyframe graph, a loop constraint evaluator adds a constraint to the receiving keyframe graph, and a graph optimizer configured to produce a map with trajectory.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other features, aspects, and advantages of this disclosure will become better understood when the following detailed description of certain exemplary embodiments is read with reference to the accompanying drawings in which like characters represent like arts throughout the drawings, wherein:

FIG. 1 is a block diagram illustrating a visual SLAM system;

FIG. 2 is a block diagram illustrating the structure of an example keyframe graph and loop constraint evaluator;

FIG. 3 illustrates a RGB-D camera sensor model; a

FIG. 4 is a block diagram of an uncertainty propagation; and

FIG. 5 illustrates an example of a map generated by a σ-DVO SLAM system

DETAILED DESCRIPTION

The following description is presented to enable any person skilled in the art to make and use the described embodiments, and is provided in the context of a particular application and its requirements. Various modifications to the described embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the described embodiments. Thus, the described embodiments are not limited to the embodiments shown, but are to be accorded the widest scope consistent with the principles and features disclosed herein.

Various operations may be described as multiple discrete actions or operations in turn, in a manner that is most helpful in understanding the claimed subject matter. However, the order of description should not be construed as to imply that these operations are necessarily order dependent. In particular, these operations may not be performed in the order of presentation. Operations described may be performed in a different order than the described embodiment. Various additional operations may be performed and/or described operations may be omitted in additional embodiments.

FIG. 1 is a block diagram illustrating a visual Simultaneous Localization and Mapping (SLAM) system 100 divided into frontend 100 a and backend 100 b. At the frontend 100 a, the system 100 uses visual odometry approach by making full use of all pixel information from an RGB-D camera to generate a local transformation estimate 112. Which is to say, dense visual odometry 108 or 110 provides a pairwise transformation estimate between two image frames 102, 104, 106. As illustrated, pairwise transformation estimate is performed between keyframe 102 and current frame 104 using dense visual odometry 108. Second pairwise transformation estimate is performed between current frame 104 and previous frame 106 using dense visual odometry 110. A keyframe generator 114 is used to generate a keyframe V_(k) based on the quality of the odometry estimate. At the backend 100 b of the system 100, a keyframe graph G⊂{V_(k)} 116 using the keyframe generator 114 is created. At a loop constraint evaluator 118, constraints based on the return, e.g. loop closure to previously visited locations are added to the keyframe graph to improve its connectivity. Graph optimizer 120 then optimizes the final graph with constraints to produce an optimized map with trajectory 122. More details on the keyframe graph 116 and the loop constraint evaluator 118 will be described below. A probabilistic sensor model is used in the front-end 100 a and performs keyframe generation 114, and loop constraint detection 118 and graph optimization 120 in the back-end 100 b.

FIG. 2 is a block diagram illustrating a structure of an example keyframe graph 200 comprises a backend graph optimization 202 and a local neighborhood 204. In the back-end graph optimization 202, the loop constraints L_(Ki,Kj) combined with odometry constraints weighted by I_(K_i,K_j) is optimized. Recent keyframe K₁ and the frames tracked with respect to K₁, (f₁, . . . , f_(n)) are included in the local neighborhood 204. The keyframes K₁ and the tracked frames (f₁, . . . , f_(n)) are determined based on the ratio of entropies H_(K1,f1)/H_(K1,fn). When the current frame does not contain sufficient information to track a new frame, a new keyframe is generated by using entropy of a camera pose estimate. The camera pose estimate generates a new keyframe when the estimated entry between the keyframe and the current frame falls below a threshold normalized by the largest estimate entropy in the local neighborhood 204. The largest estimate entropy is assumed to be the one between the keyframe and the first frame. An additional key frame generation strategy based on the curve estimate of the camera trajectory is proposed. The curve estimate ρ_(i,k) between Frames i and k is defined as the ratio of the sum of the translations between the frames (δ_(i,i−1)) in the local neighborhood N with respect to the translation between the keyframe and the latest frame (δ_(i,k)).

$\begin{matrix} {\rho = \frac{\sum\limits_{i \in N}^{k}\delta_{i,{i - 1}}}{\delta_{i,k}}} & {{equation}\mspace{14mu} (1)} \end{matrix}$

The return to a previously visited location helps identify additional constraints to the graph called loop closure at the loop constraint evaluator 118 as illustrated in FIG. 2. After optimization, the pose graph is adjusted based on the edge weights of different constraints in the graph. An erroneous loop constraint sometime can lead to a poorly optimized final trajectory. Extending previous loop constraint generation methods, two additional techniques can be used to reduce the impact of wrong loop constraints. Firstly, the loop closure constraints are weighted based on the inverse square of the metric distance between the keyframes that form the loop closure. This is based on the intuition that loop constraint between far frames is prone to a larger error than frames close to one another. Secondly, occlusion filtering is performed to remove false loop closure constraints. The depth image provides geometry information which can be used to perform occlusion filtering between two keyframes. The standard deviation of sensor model uncertainty of a depth point provides a bound on the maximum possible depth shift of the following equation:

$\begin{matrix} {{\eta \left( Z_{i} \right)} = {\frac{q_{pix}{bf}}{2}\left\lbrack {\frac{1}{{Rnd}\left( {\frac{q_{pix}{bf}}{Z_{i}} - 0.5} \right)} - \frac{1}{{Rnd}\left( {\frac{q_{pix}{bf}}{Z_{i}} + 0.5} \right)}} \right\rbrack}} & {{equation}\mspace{14mu} (2)} \end{matrix}$

All points which violates this assumption are considered as occlusion.

On generation of a new keyframe, the back-end graph is updated with the previous keyframe information and a double window graph structure 200 is created. The pose graph in the back-end is optimized using for example an open source library, g2o. A final optimization on the termination of the visual odometry is performed to generate optimized camera trajectory estimate.

Generally, RGB-D cameras project infra-red patterns and recover depth from correspondences between two image views with a small parallax. During this process, the disparity is quantized into sub-pixels. This introduces a quantization error in the depth measurement. The noise due to quantization error in depth measurement is defined as

$\begin{matrix} {{\eta \left( Z_{i} \right)} = {\frac{q_{pix}{bf}}{2}\left\lbrack {\frac{1}{{Rnd}\left( {\frac{q_{pix}{bf}}{Z_{i}} - 0.5} \right)} - \frac{1}{{Rnd}\left( {\frac{q_{pix}{bf}}{Z_{i}} + 0.5} \right)}} \right\rbrack}} & {{equation}\mspace{14mu} (3)} \end{matrix}$

where q_(pix) is the sub-pixel resolution of the device, b is the baseline, and f is the focal length. This error increases quadratically with range Z_(i), thus preventing the use of depth observations from far objects. The 3D sensor noise of RGB-D cameras can be modeled with a zero-mean multivariate Gaussian distribution whose covariance matrix has the following as the diagonal components:

$\begin{matrix} {{\sigma_{11}^{2} = {{\tan \left( \frac{\beta_{x}}{2} \right)}Z_{i}}},{\sigma_{22}^{2} = {{\tan \left( \frac{\beta_{y}}{2} \right)}Z_{i}}},{\sigma_{33}^{2} = {\eta \left( Z_{i} \right)}^{2}}} & {{equation}\mspace{14mu} (4)} \end{matrix}$

where the σ₃₃ ² direction is along the ray, and β_(x) and β_(y) denote the angular resolutions in x and y directions.

FIG. 3 illustrates a RGB-D camera sensor model. The camera is located at the origin and is looking up in the z direction. For each range of 1, 2, and 3 meters, 80 points are sampled and their uncertainties are expressed with ellipsoids. The error in the ray direction increases quadratically.

FIG. 4 is a block diagram of an uncertainty propagation. Each 3D point p_(i) in FIG. 4 is associated with a Gaussian distribution whose covariance matrices are Σ₁ and Σ₁′, respectively,

p(p_(i))=

(p_(i),Σ_(i))  equation (5)

where

$\begin{matrix} {\sum_{i}{= {{R_{ray}\begin{bmatrix} \sigma_{11}^{2} & 0 & 0 \\ 0 & \sigma_{22}^{2} & 0 \\ 0 & 0 & \sigma_{33}^{2} \end{bmatrix}}R_{ray}^{T}}}} & {{equation}\mspace{14mu} (6)} \end{matrix}$

R_(ray) denotes the rotation matrix between the ray and camera coordinates.

A method of linearization is used to propagate the uncertainty to the residuals and the likelihood function can be expressed as a Gaussian distribution,

p(r|ξ)=

(μ_(i),Σ_(i))  equation (7)

where

$\begin{matrix} {\mu_{i} = {\begin{bmatrix} \mu_{i}^{I} \\ {\mu_{i}^{Z}\;} \end{bmatrix} = \begin{bmatrix} {{I_{2}\left( {\pi \left( {g\left( {{\overset{\_}{p}}_{i},\xi} \right)} \right)} \right)} - {I_{1}\left( x_{i} \right)}} \\ {{Z_{2}\left( {\pi \left( {g\left( {{\overset{\_}{p}}_{i},\xi} \right)} \right)} \right)} - \left\lbrack {g\left( {{\overset{\_}{p}}_{i},\xi} \right)} \right\rbrack_{Z}} \end{bmatrix}}} & {{equation}\mspace{14mu} (8)} \\ {\sum_{i}{= {{J_{i}{\sum_{i}J_{i}^{\top}}} + {{diag}\left( {0,\left\lbrack \sum_{i}^{\prime} \right\rbrack_{3,3}} \right)}}}} & {{equation}\mspace{14mu} (9)} \\ {J_{i}^{\top} = {\begin{bmatrix} {\nabla r_{i}^{I}} & {\nabla r_{i}^{Z}} \end{bmatrix} = \begin{bmatrix} \frac{\partial r_{i}^{I}}{\partial p_{i}} & \frac{\partial r_{i}^{Z}}{\partial p_{i}} \end{bmatrix}}} & {{equation}\mspace{14mu} (10)} \end{matrix}$

Here, [Σ_(i)′]_(3,3) denotes the variance of the back-projected point q_(i)′ in the z axis of the current camera coordinates as shown in FIG. 4. The maximum likelihood estimation is,

$\begin{matrix} {{\hat{\xi}}_{Sensor} = {\underset{\xi}{argmin}{\sum\limits_{i = 1}^{n}{r_{i}^{\top}{\sum\limits_{i}^{- 1}r_{i}}}}}} & {{equation}\mspace{14mu} (11)} \end{matrix}$

The individual precision matrix is split as two square roots Σ_(i) ⁻¹=Σ_(i) ^(−1/2)Σ_(i) ^(−1/2) and normalize it by applying the single precision matrix of the weighted residuals Σ⁻¹ as

$\begin{matrix} {{\hat{\xi}}_{Sensor} = {\underset{\xi}{argmin}{\sum\limits_{i = 1}^{n}{r_{i}^{\top}{\sum\limits_{i}^{{- 1}/2}{\sum^{- 1}{\sum\limits_{i}^{{- 1}/2}r_{i}}}}}}}} & {{equation}\mspace{14mu} (12)} \end{matrix}$

The photometric and geometric errors can be defined as,

$\begin{matrix} {r_{i} = {\begin{bmatrix} r_{i}^{I} \\ r_{i}^{Z} \end{bmatrix} = \begin{bmatrix} {{{I_{2}\left( {\pi \left( {g\left( {{\pi^{- 1}\left( {x_{i},Z} \right)},\xi} \right)} \right)} \right)} - {I_{1}\left( x_{i} \right)}},} \\ {{Z_{2}\left( {\pi \left( {g\left( {{\pi^{- 1}\left( {x_{i},Z_{i}} \right)},\xi} \right)} \right)} \right)} - \left\lbrack {g\left( {{\pi^{- 1}\left( {x_{i},Z_{i}} \right)},\xi} \right)} \right\rbrack_{Z}} \end{bmatrix}}} & {{equation}\mspace{14mu} (13)} \end{matrix}$

where Z_(i)=Z₁(x_(i)) and [·]z denotes the z component of the vector.

To find the relative camera pose which minimizes the photometric and geometric errors, the energy function is the sum of weighted square errors as

$\begin{matrix} {\hat{\xi} = {\underset{\xi}{argmin}{\sum\limits_{i = 1}^{n}{r_{i}^{\top}{Wr}_{i}}}}} & {{equation}\mspace{14mu} (14)} \end{matrix}$

where n is the total number of valid pixels, and W∈R^(2×2) denotes the weights for different errors.

Since the energy function is non-linear with respect to the relative camera pose ξ, the Gauss-Newton algorithm is usually applied to numerically find the optimal solution and the equation (14) is now updated to:

ξ_(k+1)=ξ_(k)+Δξ,(J ^(T)(I _(n) ⊗W)J)Δξ=−J ^(T)(I _(n) ⊗W)r  equation (15)

where □ denotes the Kronecker product, r=(r₁, . . . , r_(n))^(T)∈R^(2nx1), and the Jacobian matrix is defined as

$\begin{matrix} {{J = {\begin{bmatrix} J_{1} \\ \vdots \\ J_{n} \end{bmatrix} \in {\mathbb{R}}^{2n \times 6}}},{J_{i} = {\begin{bmatrix} \frac{\partial r_{i}^{I}}{\partial\xi_{1}} & \ldots & \frac{\partial r_{i}^{I}}{\partial\xi_{6}} \\ \frac{\partial r_{i}^{Z}}{\partial\xi_{1}} & \ldots & \frac{\partial r_{i}^{Z}}{\partial\xi_{6}} \end{bmatrix} \in {\mathbb{R}}^{2 \times 6}}}} & {{equation}\mspace{14mu} (16)} \end{matrix}$

Eq. (14) is equivalent with maximum likelihood estimation where each residual is independent and follows an identical Gaussian distribution,

$\begin{matrix} {{\hat{\xi}}_{ML} = {\underset{\xi}{argmax}{\sum\limits_{i = 1}^{n}{\log \; {p\left( r_{i} \middle| \xi \right)}}}}} & {{equation}\mspace{14mu} (17)} \end{matrix}$

where p(r_(i)|ξ)=N(0, Σ). Note that this corresponds to the case of W=Σ⁻¹ in Eq. (14). The Eq. (17) can be rewritten as:

$\begin{matrix} {{\hat{\xi}}_{DVO} = {\underset{\xi}{argmin}{\sum\limits_{i = 1}^{n}{w_{i}r_{i}^{\top}{\sum^{- 1}r_{i}}}}}} & {{equation}\mspace{14mu} (18)} \end{matrix}$

where w_(i)=(v+2)/(v+r_(i) ^(T)Σ⁻¹r_(i)). Note that this corresponds to the case of W=w_(i)Σ⁻¹ in Eq. (14).

A T-distribution for photometric errors and propagate a sensor model of a Gaussian distribution for geometric errors by combining Eq (11) AND Eq (18) to now defined as σ-dense visual odometry (σ-DVO):

$\begin{matrix} {{\hat{\xi}}_{Hybrid} = {\underset{\xi}{argmin}{\sum\limits_{i = 1}^{n}{r_{i}^{\top}W_{i}^{1/2}{\sum^{- 1}{W_{i}^{1/2}r_{i}}}}}}} & {{equation}\mspace{14mu} (19)} \end{matrix}$

where the weight matrix W_(i)=diag(w_(i) ^(I), w_(i) ^(Z)) and

$\begin{matrix} {w_{i}^{I} = \frac{v + 1}{v + \left( \frac{r_{i}^{I}}{\sigma} \right)^{2}}} & {{equation}\mspace{14mu} (20)} \\ {w_{i}^{Z} = \frac{1}{{{\nabla r_{i}^{Z}}{\sum\limits_{i}^{- 1}{\nabla r_{i}^{Z_{\top}}}}} + \left\lbrack \sum_{i}^{\prime} \right\rbrack_{{3,3}\;}}} & {{equation}\mspace{14mu} (21)} \end{matrix}$

The σ-DVO algorithm can be implemented in any suitable client devices such as smart phone, tablet, mobile phone, personal digital assistant (PDA), and any devices. Back to FIG. 1, the SLAM system 100 with integrated σ-DVO algorithm uses smaller number of keyframes and is due to a reduced drift in the system. A reduced number of keyframes indicates less computational requirements in the back-end of the system.

FIG. 5 illustrates an example of a map generated by a σ-DVO SLAM system 100. As can be seen, a consistent trajectory is generated using the σ-DVO SLAM system 100.

The embodiments described above have been shown by way of example, and it should be understood that these embodiments may be susceptible to various modifications and alternative forms. It should be further understood that the claims are not intended to be limited to the particular forms disclosed, but rather to cover all modifications, equivalents, and alternatives falling with the sprit and scope of this disclosure.

Embodiments within the scope of the disclosure may also include non-transitory computer-readable storage media or machine-readable medium for carrying or having computer-executable instructions or data structures stored thereon. Such non-transitory computer-readable storage media or machine-readable medium may be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such non-transitory computer-readable storage media or machine-readable medium can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store desired program code means in the form of computer-executable instructions or data structures. Combinations of the above should also be included within the scope of the non-transitory computer-readable storage media or machine-readable medium.

Embodiments may also be practiced in distributed computing environments where tasks are performed by local and remote processing devices that are linked (either by hardwired links, wireless links, or by a combination thereof) through a communications network.

Computer-executable instructions include, for example, instructions and data which cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. Computer-executable instructions also include program modules that are executed by computers in stand-alone or network environments. Generally, program modules include routines, programs, objects, components, and data structures, etc. that perform particular tasks or implement particular abstract data types. Computer-executable instructions, associated data structures, and program modules represent examples of the program code means for executing steps of the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps.

While the patent has been described with reference to various embodiments, it will be understood that these embodiments are illustrative and that the scope of the disclosure is not limited to them. Many variations, modifications, additions, and improvements are possible. More generally, embodiments in accordance with the patent have been described in the context or particular embodiments. Functionality may be separated or combined in blocks differently in various embodiments of the disclosure or described with different terminology. These and other variations, modifications, additions, and improvements may fall within the scope of the disclosure as defined in the claims that follow. 

What is claimed is:
 1. A method for computing visual Simultaneous localization and Mapping (SLAM) comprising: generating, by a visual odometry module, a local odometry estimate; generating, by a keyframe generator, keyframes; creating keyframe graph; adding constraint to the keyframe graph using a loop constraint evaluator; and optimizing the keyframe graph with trajectory.
 2. The method of claim 1 further comprising: generating a new keyframe between a keyframe and a current frame before generating a local odometry estimate.
 3. The method of claim 2 wherein adding constraint to the keyframe graph using a loop constraint evaluator is based on a loop closure; wherein the loop closure is the return to previously visited locations.
 4. The method of claim 3, further comprising adjusting a pose graph based on edge heights of different constraints in the keyframe graph after optimization.
 5. A method of applying a probabilistic sensor model for a dense visual odometry comprising: generating, by a keyframe generator, keyframes; creating keyframe graph; adding constraint to the keyframe graph using a loop constraint evaluator; and optimizing the keyframe graph with trajectory
 6. The method of claim 5 further comprising: generating a new keyframe between a keyframe and a current frame before generating a local odometry estimate.
 7. The method of claim 6 wherein adding constraint to the keyframe graph using a loop constraint evaluator is based on a loop closure; wherein the loop closure is the return to previously visited locations.
 8. The method of claim 7, further comprising adjusting a pose graph based on edge heights of different constraints in the keyframe graph after optimization.
 9. A method of t-distribution for photometric errors and a probabilistic sensor model for geometric errors comprising: ${\hat{\xi}}_{Hybrid} = {\underset{\xi}{argmin}{\sum\limits_{i = 1}^{n}{r_{i}^{\top}W_{i}^{1/2}{\sum^{- 1}{W_{i}^{1/2}r_{i}}}}}}$
 10. A visual SLAM system comprising: a plurality of keyframes including a keyframe, a current keyframe, and a previous keyframe; a dual dense visual odometry configured to provide a pairwise transformation estimate between two of the plurality of keyframes; a frame generator configured to create keyframe graph; a loop constraint evaluator adds a constraint to the receiving keyframe graph; and a graph optimizer configured to produce a map with trajectory. 